Micro-optimizations on modern CPUs
Increased throughput by pipeline.
Execution is superscalar and out-of-order.
Latency: Time before next dependent instruction is issued.
Roofline plot: Plot arithmetic intensity vs. FLOP/s
Premise: AI is algorithm property and hardware-independent (not true!)
Rooflines depend on implementation of algorithm!
void multiply(float *A, float *B, float *C) {
for (size_t i = 0; i < N; ++i) {
for (size_t j =0; j < N; ++j) {
for (size_t k = 0; k < N; ++k) {
// very naughty memory access pattern!
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
}
}Neglecting integer operations, we have
\[\begin{equation} \mathrm{AI} = \frac{2}{3} \end{equation}\]